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Abstract 


Consider  two  linearly  ordered  sets  A,B,  |A|  =  m,  |B|  =  n  and 
p  (p  £  m  £  n)  parallel  processors.  The  paper  presents  an  adaptive 
parallel  merging  algorithm  by  disjoint  comparisons,  which  requires  at 
most  41og2  2m  +  3m/p  +  2rm/p~||_log  n/mj  steps. 


1 .  Introduction 

In  recent  years,  there  has  been  a  growing  interest  in  developing 
efficient  algorithms  for  parallel  processors,  and  the  present  paper 
continues  this  line  of  research. 

Consider  two  disjoint  linearly  ordered  sets 

A  =  {a,  <  a«  <  ...  <  a  },  B  =  {b,  <  b«  <  ...  <  b  },  m  <_  n, 

which  are  subsets  of  a  linearly  ordered  set  S.  The  problem  of  merging  A 
and  B  is  to  find  the  linear  ordering  of  aUb.  Throughout  the  paper  we  will 
assume  that  m  <_  n,  mn  >  1  and  we  have  p  parallel  processors,  p  <_  m  <_  n. 
In  the  efficiency  evaluations  we  will  consider  only  the  comparisons  between 
elements  of  A  and  elements  of  B.  A  step  is  a  set  of  comparisons  performed 
in  the  same  time  by  the  p  parallel  processors.  As  usual,  T*!  will  denote 
the  smallest  integer  bigger  than  x,  |_*J  will  denote  the  biggest  integer 
smaller  than  x,  and  |V|  will  denote  the  number  of  elements  of  a  set  V. 
Throughout  the  paper,  the  logarithms  are  in  base  2.  We  will  assume  that  the 
given  sets  A,B  are  sequential  lists  in  the  memory  of  the  computer,  the 
elements  of  A  being  located  in  the  nodes  whose  addresses  are  X+l ,  ...,  X+m 
and  the  elements  of  B  in  the  nodes  whose  addresses  are  Y+l ,  ...,  Y+n, 
X+m  <  Y.  For  e\/ery   weAUB  we  keep  in  the  node  of  w  a  field  for  a  pointer 
variable  c(w)  and  a  field  called  ADDRESS(w)  which  contains  the  original 
address  of  w  plus  one,  i.e.,  for  a.eA  ADDRESS(a.)  =  X+i+1  and  for  b.eB 

ADDRESS (b.)  =  Y+j+1 .  At  the  beginning,  c(w)  =  «  for  ewery   weAUB.  In 

w 

comparison-interchange  operations  when  we  move  w,  we  move  its  entire  node, 
hence  we  can  know  later  from  ADDRESS (w)  what  was  its  original  address.  For 


an  element  aeA  let  b.,  b.+,  be  the  place  in  B  in  which  it  must  be 

inserted  for  obtaining  the  linear  ordering  of  {a}UB;  by  inserting  a 
pointer  from  a  to  its  place  in  B  we  mean  that  we  make  c(a)  equal  to  the 
address  of  the  smallest  element  of  B  greater  than  a,  i.e.,  c(a)  =  Y+j+1 . 
A  merging  algorithm  is  called  adaptive  if  the  result  of  some  set 
of  comparisons  may  be  used  to  decide  what  will  be  the  comparisons  to  be 
performed  in  further  steps.  A  merging  algorithm  is  called  nonadaptive  if 
the  comparisons  to  be  performed  during  the  algorithm  are  initially  pre- 
scribed and  they  remain  fixed  irrespective  of  the  result  of  any  particular 
step.  Generally,  two  different  assumptions  can  be  made  on  the  given 
parallel  processor.  One  assumption  is  that  any  memory  word  can  be  accessed 
by  any  number  of  processors  in  the  same  time.  The  other  assumption  is 
that  a  memory  word  can  be  accessed  by  only  one  processor  at  a  time,  i.e., 
the  comparisons  performed  in  the  same  time  are  disjoint.  Based  on  the 
above  assumptions,  we  can  have  four  kinds  of  parallel  merging  algorithms: 

a)  nonadaptive  by  disjoint  comparisons; 

b)  nonadaptive  by  comparisons  not  necessarily  disjoint; 

c)  adaptive  by  disjoint  comparisons; 

d)  adaptive  by  comparisons  not  necessarily  disjoint. 

The  most  efficient  known  algorithm  for  nonadaptive  merging  by 
disjoint  comparisons  with  one  processor  is  the  algorithm  of  Batcher  [1], 
(see  merging  networks  in  [2]).  It  works  inductively  as  follows. 

Batcher's  algorithm.     Let  A,B  be  the  sets  to  be  merged. 

BAT  1 .     Merge  (by  comparison-interchanges  in  place) 
A1  =  {ar  a3,  ...,  *2[m/2-]-l}   with  Bl  =  {br  b3'  "•■  b2fn/2l-l} 
obtaining  the  sorted  result  C-,  =  {v, ,  Vp,  ...,  vrm/pi  +  fn/21^' 


BAT  2.  Merge  A2  =  {a2,  a4,    ...,  a2lm/2J*  with  B2  =  {b2'  b4' 

b2.n/2,}  obtaining  the  sorted  result  C2  =  {w-j ,  w2,   . ..,  Wj.2j  +  i  n/2  J  ^ " 


BAT  3.  On  the  sequence 


•   •* 


C  =  {vr  wr  v2,  w2,  ...,  vLm/2J  +  Ln/2J,  wLm/2j  +  Ln/2J,  V  ,  w  }  , 
perform  comparison-interchange  operations  of  the  following  pairs: 


* 


Tv2'  W  ••"  wLm/2J  +  Ln/2J 


* 


(v  =  Vi  ,„i   i  ,0  i   i  if  m  or  n  is  odd,  and  it  does  not  exist  otherwise; 
Lm/2J  +  Ln/2J  +  1 

** 

v   =  V,  ,n  ,       i  ,o  i   «  ifm  and  n  are  odd,  and  it  does  not  exist  other- 
Lm/2j  +  Ln/2J  +  2 

wise).  The  result  will  be  sorted.  For  a  complete  proof  see  Knuth  [2].  This 
algorithm  requires  at  most  ((m+n)/2)logm  +  n  steps.  As  proved  by  A.  and 
F.  Yao  [3],  such  an  algorithm  requires  at  least  (n/2)  log(m+l)  steps.  There- 
fore, the  algorithm  of  Batcher  is  asymptotically  optimal.  This  algorithm 
can  be  easily  implemented  as  a  nonadaptive  merging  algorithm  by  disjoint 
comparisons  with  p  processors  as  follows.  Perform  BAT  1  using  Tp/2l 
processors  and  BAT  2  using  Lp/2J  processors,  in  parallel.  After  this, 
perform  BAT  3  using  p  processors. 

A  nonadaptive  parallel  merging  algorithm,  by  comparisons  not 
necessarily  disjoint  can  be  obtained  from  Batcher's  algorithm  by  performing 
in  parallel  the  first  p  required  comparisons,  then  the  next  p  comparisons 
and  so  on,  without  caring  whether  the  comparisons  are  disjoint  or  not. 

The  adaptive  parallel  merging  by  comparisons  not  necessarily  disjoint, 
was  discussed  in  [4]  which  contains  an  algorithm  requiring  at  most 
2riog(2m+l)l  +  |_3m/pJ  +  fm/plLlog  n/mj  steps. 


The  purpose  of  this  paper  is  to  describe  an  adaptive  parallel 
merging  algorithm  by  disjoint  comparisons  which  requires  at  most 
41og  2m  +  3m/p  +  2rm/p]Llog  n/mj  steps  and  is  asymptotically  optimal 
whenever  p  is  a  function  growing  slower  than  m/41og  2m.  Algorithms  A,B,C 
are  auxiliary,  and  they  are  used  in  the  Main  Algorithm. 

2.  Algorithm  A 

Consider  the  two  ordered  sets  A,B  and  the  p  parallel  processors. 
Algorithm  A  is  a  modification  of  Batcher's  algorithm  which  inserts  for  every 
aeA  (beB)  a  pointer  to  its  place  in  B  (in  A  respectively).  The  Algorithm  A 
works  inductively  as  follows. 

Perform  BAT  1  using  Tp/21  processors  and  for  every  aeA,  (beB,)  make 

c(a)  (c(b))  equal  to  the  address  in  B  (in  A)  of  the  smallest  element  of 
BflC,  (AOC,)  greater  than  a  (b  respectively).  Let  us  denote  these  values 

of  c(a),  c(b)  by  c(a),  c"(b). 

Perform  BAT  2  using  LP/2J  processors  and  assign  the  values 
c(a),  c(b),  aeAp,  beBp,  as  above. 

Now,  in  BAT  3,  on  the  sequence 

C  =  {vr  W],  v2,  w2,  ...,  vLm/2J  +  [n/2J,  wLm/2J  +  Ln/2J,  V  ,  w  } 

we  perform  the  following  comparison-interchange  operations  using  p 
processors: 

W  w2:v3'  ••"  >/2j  +  Ln/2J:V*  ' 
Consider  some  comparison  w. :v.+,  and  assume  that  the  result  of  this 

comparison-interchange  is  w.  <  v.+,  (if  v.+,  <  w.  the  treatment  will  be 


mi 


similar).  Assume  that  v.+1  eA  (the  treatment  is  similar  when  v.+-,  eB). 

If  wi  eB,  then  we  put  c^.)  =  ADDRESS(vi+1  )-l .  If  wi  eA,  then  let  v. 

be  the  element  of  B  whose  address  is  c(v.+,).  Since  v.+,  <  v.+^   <  ... 

<  v.,  it  follows  that  v-+-,,  v.+2»  . ... ,  v.  ,  eA.  Also,  for  every 

v. ,  k  <  i+1 ,  the  pair  containing  v.  is  at  the  left  of  the  pair  containing 

w.,  and  hence  v.  <  w..  Thus  c~(v.+-,)  is  the  address  of  the  smallest  element 

of  BflC-,  greater  than  w. .  Therefore,  c(w. )  =  min(c(w. ),  c(v.+-.))  is  the 

address  of  the  smallest  element  of  Bf)C  greater  than  w. . 

Let  us  now  calculate  c(v.+,).  If  wi+1  eB  then  c(vi+-|)  = 

n(c(v.+1),  ADDRESS(wi+1)-l)  and  if  wi+1  eA,  then  c(vi+1)  =  min(c(vi+1), 

c"(w.+1)).  Now,  using  the  ADDRESS  of  every   node  we  return  every   weC  to  its 
original  place  in  A  or  B. 

When  m  =  n  =  p,  Algorithm  A  requires  pog  2ml  steps. 

3.  Algorithm  B 

Consider  the  two  ordered  sets  A,B  and  assume  that  we  have  m  parallel 
processors.  Algorithm  B  is  an  adaptive  algorithm  which  inserts  for  every 
aeA  (beB)  a  pointer  to  its  place  in  B  (in  A  respectively).  In  fact,  it  is 
a  merging  algorithm.)  It  works  as  follows.  Denote  r  =  rn/(m+l)l  and  let 

B  =  {br,  b2r,  ...,  bmr}  |B|  =  m,  BcB. 

First,  using  the  m  processors,  we  perform  on  A  and  B"  the  Algorithm  A,  in 
Tlog  2ml  steps.  The  set  B"  divides  B  in  m+1  intervals  B, ,  B«,  ...,  B  , , 

every   of  length  |_n/m+l)J.  For  every   interval  B.  defined  by  b/._ ,  v  and 


b.  ,  the  pointers  c(b/.  .«  )  and  c(b.  )  define  on  A  an  interval  A.  starting 
with  the  successor  of  the  element  pointed  by  c(b#j  ,*  )  and  ending  with  the 
element  pointed  by  c(b.  ).  Now,  to  eyery   interval  B..  we  assign  |Ai  | 
processors  and  for  eyery   1  <_  i  <_  m  +  1  we  perform  Algorithm  B  on  A.  and  B. . 

Let  M(n,m,m)  be  the  number  of  steps  required  by  Algorithm  B.  Hence, 

M(n,m,m)  =  Hog  2ml  +    max    M(Ln/(m+l)J,  j,  j).  Let  us  prove  by 

1  <_  j  <_  m 

induction  on  m  and  n  that  M(n,m,m)  <_  21og  n.  Clearly,  M(n,l,l)  =  pog(n+l)l 

and  M(m,m,m)  =  ["log  2ml.  Assume  that  the  relation  is  true  for  less  than 

m  processors  or  less  than  n  elements.  Then 

M(n,m,m)  =  pog  2ml  +    max    M(|_n/(m+l)J,  j,  j) 

1  <_  j  <_  m 

£  Hog  2ml  +  21og|_n/(m+l)J  £  21og  n. 

4.  Algorithm  C 

Algorithm  C  (based  on  an  observation  of  Valiant  [51)  uses  two 
parallel  processors  to  perform  on  A,B  the  following  algorithm: 

The  first  processor  performs  x  comparisons  starting  with  a-,,  b, . 

In  some  stage  it  compares  a.,  b..  Then:  if  a.  <  b.,  it  makes  c(a. )  =  Y  +  j 
and  continues  by  comparing  a.+,,  b.;  if  a.  >  b.,  it  makes  c(b.)  =  X  +  i  and 
continues  by  comparing  a. ,  b.+,.  After  the  first  processor  finishes,  the 
second  processor  performs  m+n-l-x  comparisons  starting  with  a  ,  b  .  In  some 
stage  it  compares  a.  ,  b  .  Then:  if  a.  <  b  ,  it  makes  c(b  )  =  X+k+1  and 


con 


tinues  by  comparing  a.  ,  b  ,s  if  a,  >  b  ,  it  makes  c(a.  )  =  Y+r+1  and 


continues  by  comparing  a.+^ ,  bp. 

It  is  easy  to  see  that  after  both  processors  are  finished,  there  is 
a  pointer  from  every  alement  of  A  to  its  place  in  B  and  from  every   element 
of  B  to  its  place  in  A. 

5.  The  Main  Algorithm 

Let  us  now  describe  the  algorithm  for  adaptive  parallel  merging  of 
A  and  B  by  disjoint  comparisons  using  p  processors.  Denote 
t  =  Llog  n/mj,  u  =  2  ,  v  =  |_n/u_|,  s  =  I'm/pi,  and  let 

B=  {bu,  b2u,  ...,  bvu}  .  |B|  =  v  . 

Bi =  {burv/pv  b2urv/Pv  ■••■  b(p-i)urv/Pi}  »  l^i  =  p  - 1  , 

^  =  {as,  a2s,  ...,  a(p_1)s},  |A-,|  =  p  -  1  . 

It  is  easy  to  see  that  m  <_  v  <_  2m,  n/(2m)  <_  u  <_  n/m,  ["v/pl  <_  f2m/pl, 
B,cBcB  and  A,cA. 

Our  first  task  is  to  find  the  place  of  every   element  of  A  in  B. 
In  the  description,  we  will  refer  to  B"  independently  of  B,  but  this  part 
of  the  algorithm  can  be  performed  while  the  elements  of  B  remain  in  their 
places  in  B  (without  recopying  B  out  of  B).  This  part  works  as  follows. 

Using  p-1  processors,  we  perform  Algorithm  B  on  B,  and  A,  and  after 

this  on  A,  and  B  in  41og  2m  steps.  The  elements  w  of  A,UB,  and  their 
pointers  c(w)  define  two  families  of  successive  disjoint  intervals 
{A,,  A«,  ...,  A2d-1*  0n  ^  anC*  *B1'  B2'  ••*'  B2n-1^  on  ^'      c^ear^y'  to  find 
the  place  in  B  of  an  element  of  A.  it  is  enough  to  find  its  place  in  B. . 


The  elements  of  ff.  divide  B  in  p  segments  B1 ,  ...,  Bp  every   of 
length  at  most  |"v/pl,  and  the  elements  of  A^  divide  A  in  p  segments 
A,,  ...,  A   every  of  length  at  most  I'm/pi.  For  every  A.. ,  Bi  there  is  a 
unique  j  such  that  Ai  o  a\  and  a  unique  k  such  that  B1  CBk.  For  every 
1  £  j  £  p,  let  us  assign  the  j-th  processor  to  A^.  The  segment  A^  is 

defined  by  a/,-_-|\s  and  a.  .  From  the  pointers  c(a(i_-ns)  anc'  c(ais)  we 

can  find  the  elements  of  B,  between  c(a(-ji\s)  and  c(a-js)-  From  tne 

pointers  of  these  elements  of  B,  and  a(,-_-ns»  a,-s  we  can  find  the 

intervals  A.,  A.+,,  . ..,  A.+.  contained  in  A.  and  their  correspondents 

B.,  B.+1,  ...,  B.+. .  Now,  using  the  j-th  processor  we  perform  sequentially 

|A-+  |  operations  as  in  Algorithm  C  on  every   pair  A.+  .  Bi+r>  1  £  r  ±  k, 

starting  with  the  smallest  elements.  We  perform  this  in  parallel  on  all 
the  segments  A.,  1  <_  j  ^  p.  After  this,  for  every  1  <_  j  <_  p  we  assign  the 

j-th  processor  to  B..  We  find  the  intervals  B  ,  B  , ,  ...,  B  .   contained 

in  B.  and  their  correspondents  A  ,  A  , ,  ...,  A  .   as  above.  Now,  using 

the  j-th  processor  we  perform  sequentially  |B   |  -  1  operations  as  in 

Algorithm  C  on  every   pair  A   ,  B   ,  1  <_  r  <_  h,  starting  with  the  biggest 

elements.  We  perform  this  in  parallel  on  all  the  segments  B.,  1  <_  j  <_  p. 

In  this  way,  we  perform  in  fact  Algorithm  C  on  every   pair  A.,  B. , 

1  <.  i  <.  2p  -  1,  and  hence  for  every  element  of  A  we  obtain  a  pointer  to  its 
place  in  B.  In  the  above  process  the  parallel  comparisons  are  disjoint  and 


every  processor  performs  at  most  |"v/pl  +  ["m/pl  -  2  <_  3m/p  comparisons. 

The  elements  of  B"  divide  B  -  B  in  v  +  1  intervals,  every   of  length 
u  -  1.  Thus,  for  merging  A  and  B  we  have  to  insert  every  element  a  of  A  in 
the  interval  of  B  between  the  addresses  c(a)  -  u  and  c(a).  We  do  this  in 
the  following  way. 

We  merge  the  first  p  elements  of  A  with  B  as  follows.  We  consider 
in  parallel  the  first  p  intervals  of  B  defined  by  B,  and  using  c(b.  ), 

c(b/.+-.\  ),  1  <_  i  <_  p,  we  find  the  segment  (and  its  length)  among 

{a-,,  ...,  a  }  which  must  be  inserted  in  every  interval.  We  do  the  same 

on  the  next  p  intervals  of  B,  and  so  on,  until  we  arrive  to  the  interval 
in  which  a  must  be  inserted.  Then,  we  assign  to  every   interval  of  the 

above  stage  a  number  of  processors  equal  to  the  segment  to  be  inserted  in 
this  interval,  and  we  perform  Algorithm  B  on  every  interval  and  its 
corresponding  segment. 

After  this,  we  merge  the  next  p  elements  of  A  and  the  corresponding 
intervals  of  B,  this  time  starting  with  the  last  interval  used  in  the 
previous  stage,  and  so  on. 

Since  |A|  =  m,  this  process  will  require  2rm/p"|log  u  =  2rm/pl|_log  n/mj 
steps. 

At  this  stage,  for  every  aeA  we  have  a  pointer  c(a)  from  a  to  its 
place  in  B.  It  remains  to  adjust  the  pointers  ADDRESS  for  obtaining  a 
linked  linear  ordering  of  AUB,  as  follows.  For  every   1  <_  i  <_  Lm/2J  we 
check  whether  c(a«.  ,)  =  c(a2-)  and  remember  it.  Then,  for  every 

1  £i  1  Lm/2J  we  check  whether  c(a2i)  =  c(a2.+,)  and  remember  this  too. 
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Now,  for  every  1  <_  i  <_  m,  if  c(a.)  f   c(a.+,),  we  put  ADDRESS(a.)  =  c(a.) 

and  ADDRESS(b.  , )  =  X  +  i  +  1,  when  b.  is  the  element  of  B  whose  address 
J-i  J 

is  c(ai+1).  Also,  if  c^)  points  to  bi+1 ,  we  put  ADDRESS^.)  =  X  +  1. 

The  number  of  steps  required  by  the  entire  Main  Algorithm  is 

41og  2m  +  3m/p  +  2rm/p"||.log  n/mj  . 
Since  any  adaptive  parallel  merging  algorithm  requires  at  least  (m/p)log  n/m 
+  m/p  steps  ([4]),  the  above  algorithm  is  asymptotically  optimal  whenever 
p  is  a  function  growing  slower  than  m/41og  2m. 
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